36 research outputs found
From Traditional to Modern : Domain Adaptation for Action Classification in Short Social Video Clips
Short internet video clips like vines present a significantly wild
distribution compared to traditional video datasets. In this paper, we focus on
the problem of unsupervised action classification in wild vines using
traditional labeled datasets. To this end, we use a data augmentation based
simple domain adaptation strategy. We utilise semantic word2vec space as a
common subspace to embed video features from both, labeled source domain and
unlablled target domain. Our method incrementally augments the labeled source
with target samples and iteratively modifies the embedding function to bring
the source and target distributions together. Additionally, we utilise a
multi-modal representation that incorporates noisy semantic information
available in form of hash-tags. We show the effectiveness of this simple
adaptation technique on a test set of vines and achieve notable improvements in
performance.Comment: 9 pages, GCPR, 201
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
We propose a weakly-supervised framework for action labeling in video, where
only the order of occurring actions is required during training time. The key
challenge is that the per-frame alignments between the input (video) and label
(action) sequences are unknown during training. We address this by introducing
the Extended Connectionist Temporal Classification (ECTC) framework to
efficiently evaluate all possible alignments via dynamic programming and
explicitly enforce their consistency with frame-to-frame visual similarities.
This protects the model from distractions of visually inconsistent or
degenerated alignments without the need of temporal supervision. We further
extend our framework to the semi-supervised case when a few frames are sparsely
annotated in a video. With less than 1% of labeled frames per video, our method
is able to outperform existing semi-supervised approaches and achieve
comparable performance to that of fully supervised approaches.Comment: To appear in ECCV 201
Data Mining for Action Recognition
© Springer International Publishing Switzerland 2015. In recent years, dense trajectories have shown to be an efficient representation for action recognition and have achieved state-of-the art results on a variety of increasingly difficult datasets. However, while the features have greatly improved the recognition scores, the training process and machine learning used hasn’t in general deviated from the object recognition based SVM approach. This is despite the increase in quantity and complexity of the features used. This paper improves the performance of action recognition through two data mining techniques, APriori association rule mining and Contrast Set Mining. These techniques are ideally suited to action recognition and in particular, dense trajectory features as they can utilise the large amounts of data, to identify far shorter discriminative subsets of features called rules. Experimental results on one of the most challenging datasets, Hollywood2 outperforms the current state-of-the-art
Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition
Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for Image classification and showing promise for videos, has still not clearly superseded action recognition methods using hand-crafted features, even when training on
massive datasets. In this paper, we introduce hybrid video classification architectures based on carefully designed unsupervised representations of hand-crafted
spatio-temporal features classified by supervised deep networks. As we show in our experiments on five popular benchmarks for action recognition, our hybrid
model combines the best of both worlds: it is data efficient (trained on 150 to 10000 short clips) and yet improves significantly on the state of the art, including
recent deep models trained on millions of manually labelled images and videos
Globally Continuous and Non-Markovian Crowd Activity Analysis from Videos
Automatically recognizing activities in video is a classic problem in vision and helps to understand behaviors, describe scenes and detect anomalies. We propose an unsupervised method for such purposes. Given video data, we discover recurring activity patterns that appear, peak, wane and disappear over time. By using non-parametric Bayesian methods, we learn coupled spatial and temporal patterns with minimum prior knowledge. To model the temporal changes of patterns, previous works compute Markovian progressions or locally continuous motifs whereas we model time in a globally continuous and non-Markovian way. Visually, the patterns depict flows of major activities. Temporally, each pattern has its own unique appearance-disappearance cycles. To compute compact pattern representations, we also propose a hybrid sampling method. By combining these patterns with detailed environment information, we interpret the semantics of activities and report anomalies. Also, our method fits data better and detects anomalies that were difficult to detect previously
Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts
Abstract. In this paper, we present a new method which estimates the pose of a human body and identifies its action from one single static image. This is a challenging task due to the high degrees of freedom of body poses and lack of any motion cues. Specifically, we build a pool of pose experts, each of which individually models a particular type of articulation for a group of human bodies with similar poses or semantics (actions). We investigate two ways to construct these pose experts and show that this method leads to improved pose estimation performance under difficult conditions. Furthermore, in contrast to previous wisdoms of combining the output of each pose expert for action recognition using such method as majority voting, we propose a flexible strategy which adaptively integrates them in a discriminative framework, allowing each pose expert to adjust their roles in action prediction according to their specificity when facing different action types. In particular, the spatial re-lationship between estimated part locations from each expert is encoded in a graph structure, capturing both the non-local and local spatial corre-lation of the body shape. Each graph is then treated as a separate group, on which an overall group sparse constraint is imposed to train the pre-diction model, with extra weight added according to the confidence of the corresponding expert. We show in our experiments on a challenging web data set with state of the art results that our method effectively improves the tolerance of our system to imperfect pose estimation.
Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation
Zero-Shot Learning (ZSL) promises to scale visual recognition by bypassing
the conventional model training requirement of annotated examples for every
category. This is achieved by establishing a mapping connecting low-level
features and a semantic description of the label space, referred as
visual-semantic mapping, on auxiliary data. Reusing the learned mapping to
project target videos into an embedding space thus allows novel-classes to be
recognised by nearest neighbour inference. However, existing ZSL methods suffer
from auxiliary-target domain shift intrinsically induced by assuming the same
mapping for the disjoint auxiliary and target classes. This compromises the
generalisation accuracy of ZSL recognition on the target data. In this work, we
improve the ability of ZSL to generalise across this domain shift in both
model- and data-centric ways by formulating a visual-semantic mapping with
better generalisation properties and a dynamic data re-weighting method to
prioritise auxiliary data that are relevant to the target classes.
Specifically: (1) We introduce a multi-task visual-semantic mapping to improve
generalisation by constraining the semantic mapping parameters to lie on a
low-dimensional manifold, (2) We explore prioritised data augmentation by
expanding the pool of auxiliary data with additional instances weighted by
relevance to the target domain. The proposed new model is applied to the
challenging zero-shot action recognition problem to demonstrate its advantages
over existing ZSL models.Comment: Published in ECCV 201
Global Regularizer and Temporal-aware Cross-entropy for Skeleton-based Early Action Recognition
In this paper, we propose a new approach to recognize the class label of an action before this action is fully performed based on skeleton sequences. Compared to action recognition which uses fully observed action sequences, early action recognition with partial sequences is much more challenging mainly due to: (1) the global information of a long-term action is not available in the partial sequence, and (2) the partial sequences at different observation ratios of an action contain a number of sub-actions with diverse motion information. To address the first challenge, we introduce a global regularizer to learn a hidden feature space, where the statistical properties of the partial sequences are similar to those of the full sequences. We introduce a temporal-aware cross-entropy to address the second challenge and achieve better prediction performance. We evaluate the proposed method on three challenging skeleton datasets. Experimental results show the superiority of the proposed method for skeleton-based early action recognition
Dynamic behavior analysis via structured rank minimization
Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear time-invariant system when behavioral cues act as the input (e.g., continuous rather than discrete annotations of dimensional affect). Here we study the learning of such dynamical system under real-world conditions, namely in the presence of noisy behavioral cues descriptors and possibly unreliable annotations by employing structured rank minimization. To this end, a novel structured rank minimization method and its scalable variant are proposed. The generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods for these tasks and, hence, evidence the robustness and effectiveness of the proposed approach